-
Notifications
You must be signed in to change notification settings - Fork 1k
Add merge and merge_n algorithms
#8753
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
The optimisation work that was done in #8653 would make sense here as well. That has not been done yet. |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pepijnve -- what do you think about also adding benchmarks to this kernel (so that future optimizations work better)
Good idea. I’m happy to continue working on this one. I created the PR already to get the ball rolling and solicit input from other devs. |
While looking into this I realised that |
@alamb I duplicated the microbenchmark for zip as a quick fix. Is it worth trying to actually share the sets of input data and masks? If so, where should I move that code? |
| /// | ||
| /// ``` | ||
| pub fn merge_n(values: &[&dyn Array], indices: &[impl MergeIndex]) -> Result<ArrayRef, ArrowError> { | ||
| let data_type = values[0].data_type(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no check for empty values array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check added along with unit tests
arrow-select/src/merge.rs
Outdated
| let falsy = falsy_array.to_data(); | ||
| let truthy = truthy_array.to_data(); | ||
|
|
||
| let mut mutable = MutableArrayData::new(vec![&truthy, &falsy], false, truthy.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| let mut mutable = MutableArrayData::new(vec![&truthy, &falsy], false, truthy.len()); | |
| let mut mutable = MutableArrayData::new(vec![&truthy, &falsy], false, mask.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
arrow-select/src/merge.rs
Outdated
| /// Long spans of null values are also especially cheap because they do not need to be represented | ||
| /// in an input array. | ||
| /// | ||
| /// # Safety |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| /// # Safety | |
| /// # Panics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
arrow-select/src/merge.rs
Outdated
| use arrow_data::transform::MutableArrayData; | ||
| use arrow_schema::ArrowError; | ||
|
|
||
| /// An index for the [merge] function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| /// An index for the [merge] function. | |
| /// An index for the [merge_n] function. |
| &mut group, | ||
| &masks, | ||
| &array_1_10pct_nulls, | ||
| &non_null_scalar_1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The arguments here look exactly the same as for array_vs_non_null_scalar above. I think the last two arguments should be swapped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. I had copied these from zip_kernel.rs which has the same mistake. Fixing here and in zip_kernel.rs.
|
There's a failing test case, but I can't say I see the relationship with this change. |
Which issue does this PR close?
Rationale for this change
The algorithms suggested in this PR originate from the
caselogic in DataFusion (see datafusion#18152 and datafusion#18444). I think it might be useful to move them toarrow-rsinstead of being tucked away in a corner of the DataFusion codebase.What changes are included in this PR?
Adds a two-way and n-way merge algorithm that's halfway between
zipandinterleave. In contrast tozipthe truthy and falsy arrays do not need to be prealigned. In contrast tointerleavethe relative order of elements in each input array is retained in the final result.Are these changes tested?
I've already added two minimal unit tests, more should probably be added.
Are there any user-facing changes?
No breaking API changes